Stabilize diskless no-drop replication test#3511
Stabilize diskless no-drop replication test#3511zuiderkwast merged 7 commits intovalkey-io:unstablefrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3511 +/- ##
============================================
- Coverage 76.58% 76.38% -0.20%
============================================
Files 159 159
Lines 80019 80019
============================================
- Hits 61283 61125 -158
- Misses 18736 18894 +158 🚀 New features to boost your workflow:
|
|
Did this test fail recently, I know viktor attempted to fix it in this PR: #3461 Edit: I just saw the comments on the PR, looks like it did fail again recently. |
zuiderkwast
left a comment
There was a problem hiding this comment.
Looks pretty good. Only the "no replicas drop" case is covered, so the other cases can still have timing issues, such as "fast" and "slow" cases? I see there are some special cases for the other cases, for example for "timeout" there is pause_process as well. Do you have a full picture?
14afd0b to
5a24b23
Compare
|
@zuiderkwast 100 runs for all the variants passed: https://github.com/sarthakaggarwal97/valkey/actions/runs/24521848934 |
5a24b23 to
17730b5
Compare
|
@zuiderkwast I think it still deflakes the tests a lot, but I am afraid out of 500, I still see 1-5 flaky runs. |
56ae8dc to
1e467b0
Compare
|
This version is quite stable. Not seeing failures anymore. |
zuiderkwast
left a comment
There was a problem hiding this comment.
Very good! Sorry for the delay.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
The 'slow' subcase fallback was searching for a log message matching '*Connection with replica client id * lost.*' but the actual server log format is 'Connection with replica <host>:<port> lost.' — there is no 'client id' in the message. Fix the glob to '*Connection with replica * lost.*'. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
On fast ARM CI runners the RDB transfer completes before both replicas are killed, so the primary logs '2 replicas still up' instead of 'last replica dropped' or '1 replicas still up'. Add a nested catch fallback to accept all three possible outcomes. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
9a86d16 to
fe1d7e6
Compare
|
The test failures are unrelated to this change. Related to #2936 |
This deflakes all variants of `diskless replicas drop during rdb pipe`. The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile. CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515 --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com> Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
This deflakes all variants of
diskless replicas drop during rdb pipe.The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile.
CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515